Frequency Sensitive Competitive Learning for Balanced Clustering on High-dimensional Hyperspheres
نویسنده
چکیده
Competitive learning mechanisms for clustering in general suffer from poor performance for very high dimensional ( ) data because of “curse of dimensionality” effects. In applications such as document clustering, it is customary to normalize the high dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft EM based version, spkmeans tends to generate extremely imbalanced clusters in high dimensional spaces when the desired number of clusters is large (tens or more). In this paper, we first show that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model and in fact it can be considered as a batch mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency sensitive competitive learning variants that are applicable to static data and produced high quality and well balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. We also propose a frequency sensitive algorithm to cluster streaming data. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.
منابع مشابه
Frequency Sensitive Competitive Learning for Clustering on High-dimensional Hyperspheres
This paper derives three competitive learning mechanisms from first principles to obtain clusters of comparable sizes when both inputs and representatives are normalized. These mechanisms are very effective in achieving balanced grouping of inputs in high dimensional spaces, as illustrated by experimental results on clustering two popular text data sets in 26,099 and 21,839 dimensional spaces r...
متن کاملHigh-dimensional clustering using frequency sensitive competitive learning
In this paper a clustering algorithm for sparsely sampled high-dimensional feature spaces is proposed. The algorithm performs clustering by employing a distance measure that compensates for diierently sized clusters. A sequential version of the algorithm is constructed in the form of a frequency sensitive Competitive Learning scheme. Experiments are conducted on an artiicial gaussian data set a...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملContents I Part A 7 1 Clustering with Balancing Constraints 9
In many applications of clustering, solutions that are balanced, i.e, where the clusters obtained are of comparable sizes, are preferred. This chapter describes several approaches to obtaining balanced clustering results that also scale well to large data sets. First, we describe a general scalable framework for obtaining balanced clustering which first clusters only a small subset of the data ...
متن کاملTexture Segmentation by Frequency-Sensitive Elliptical Competitive Learning
In this paper a new learning algorithm is proposed with the purpose of texture segmentation. The algorithm is a competitive clustering scheme with two specific features: elliptical clustering is accomplished by incorporating the Mahalanobis distance measure into the learning rules, and underutilization of smaller clusters is avoided by incorporating a frequency-sensitive term. In the paper, an ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004